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Abstract 

Independent Component Analysis (ICA) is a popular model for blind signal separation. The 
ICA model assumes that a number of independent source signals are linearly mixed to form the 
observed signals. We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), 
for provable model recovery for IGA with Gaussian noise. The main technical innovation of the 
algorithm is to use a fixed point iteration in a pseudo-Euclidean (indefinite “inner product”) 
space. The use of this indefinite “inner product” resolves technical issues common to several 
existing algorithms for noisy IGA. This leads to an algorithm which is conceptually simple, 
efficient and accurate in testing. 

Our second contribution is combining PEGI with the analysis of objectives for optimal re¬ 
covery in the noisy IGA model. It has been observed that the direct approach of demixing with 
the inverse of the mixing matrix is suboptimal for signal recovery in terms of the natural Signal 
to Interference plus Noise Ratio (SINK) criterion. There have been several partial solutions 
proposed in the IGA literature. It turns out that any solution to the mixing matrix reconstruc¬ 
tion problem can be used to construct an SINR-optimal IGA demixing, despite the fact that 
SINR itself cannot be computed from data. That allows us to obtain a practical and provably 
SINR-optimal recovery method for ICA with arbitrary Gaussian noise. 


1 Introduction 

Independent Component Analysis refers to a class of methods aiming at recovering statistically 
independent signals by observing their unknown linear combination. There is an extensive literature 
on this and a number of related problems [7] . 

Specifically, in the ICA model, we observe n-dimensional realizations x(l),..., x(V) of a latent 
variable model X = Sk-^k = where Ak denotes the column of the n x m mixing 

matrix A and S = {Si,... is the unseen latent random vector of “signals”. It is assumed 

that Si,..., Sm are independent and non-Gaussian. The source signals and entries of A may be 
either real- or complex-valued. For simplicity, we will assume throughout that S has zero mean, as 
this may be achieved in practice by centering the observed data. 

Many ICA algorithms use the preprocessing “whitening” step whose goal is to orthogonalize the 
independent components. In the noiseless, case this is commonly done by computing the square 
root of the covariance matrix of X. Consider now the noisy ICA model X = AS+r/ with additive 0- 
mean noise r] independent of S. It turns out that the introduction of noise makes accurate recovery 
of the signals significantly more involved. Specifically, whitening using the covariance matrix does 
not work in the noisy ICA model as the covariance matrix combines both signal and noise. For 
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the case when the noise is Gaussian, matrices constructed from higher order statistics (specifically, 
cumulants) can be used instead of the covariance matrix. However, these matrices are not in general 
positive definite and thus the square root cannot always be extracted. This limits the applicability 
of several previous methods, such as The GI-ICA algorithm proposed in |2T] addresses 

this issue by using a complicated quasi-orthogonalization step followed by an iterative method. 

In this paper (section [^, we develop a simple and practical one-step algorithm, PEGI (for 
pseudo-Euclidean Gradient Iteration) for provably recovering A (up to the unavoidable ambiguities 
of the model) in the case when the noise is Gaussian (with an arbitrary, unknown covariance 
matrix). The main technical innovation of our approach is to formulate the recovery problem as a 
fixed point method in an indefinite “inner product” (pseudo-Euclidean) space. 

The second contribution of the paper is combining PEGI with the analysis of objectives for 
optimal recovery in the noisy IGA model. In most applications of IGA (e.g., speech separation |18j . 
MEG/EEG artifact removal [20] and others) one cares about recovering the signals s(l),. .. ,s(N). 
This is known as the source recovery problem. This is typically done by first recovering the matrix 
A (up to an appropriate scaling of the column directions). 

At first, source recovery and recovering the mixing matrix A appear to be essentially equivalent. 
In the noiseless IGA model, if A in invertibl^ then s(t) = A“^x(t) recovers the sources. On the 
other hand, in the noisy model, the exact recovery of the latent sources s(t) becomes impossible 
even if A is known exactly. Part of the “noise” can be incorporated into the “signal” preserving 
the form of the model. Even worse, neither A nor S are defined uniquely as there is an inherent 
ambiguity in the setting. There could be many equivalent decompositions of the observed signal as 
X = A'S' -|- ? 7 ' (see the discussion in section]^. 

We consider recovered signals of the form S{B) := HX for a choice of m x n demixing matrix B. 
Signal recovery is considered optimal if the coordinates of S{B) = {Si{B ),..., SmiB)) maximize 
Signal to Interference plus Noise Ratio (SINK) within any fixed model X = AS -|- rj. Note that 
the value of SINK depends on the decomposition of the observed data into “noise” and “signal”: 
X = A'S' + T]'. 

Surprisingly, the SINK optimal demixing matrix does not depend on the decomposition of 
data into signal plus noise. As such, SINK optimal ICA recovery is well dehned given access to 
data despite the inherent ambiguity in the model. Further, it will be seen that the SINK optimal 
demixing can be constructed from cov(X) and the directions of the columns of A (which are also 
well-defined across signal/noise decompositions). 

Our SINR-optimal demixing approach combined with the PEGI algorithm provides a complete 
SINR-optimal recovery algorithm in the ICA model with arbitrary Gaussian noise. We note that the 
ICA papers of which we are aware that discuss optimal demixing do not observe that SINR optimal 
demixing is invariant to the choice of signal/noise decomposition. Instead, they propose more 
limited strategies for improving the demixing quality within a fixed ICA model. For instance, Joho 
et ah [131 show how SINR-optimal demixing can be approximated with extra sensors when assuming 
a white additive noise, and Koldovsky and Tichavsky m discuss how to achieve asymptotically 
low bias ICA demixing assuming white noise within a fixed ICA model. However, the invariance 
of the SINR-optimal demixing matrix appears in the array sensor systems literature [Hj. 

Finally, in section [^ we demonstrate experimentally that our proposed algorithm for ICA out¬ 
performs existing practical algorithms at the task of noisy signal recovery, including those specif¬ 
ically designed for beamforming, when given sufficiently many samples. Moreover, most existing 
practical algorithms for noisy source recovery have a bias and cannot recover the optimal demixing 
matrix even with infinite samples. We also show that PEGI requires significantly fewer samples 

^ A~^ can be replaced with in the discussion below for over-determined ICA. 
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than GI-ICA [2T] to perform ICA accurately. 

1.1 The Indeterminacies of ICA 

Notation: We use M* to denote the entry-wise complex conjugate of a matrix M , to denote 
its transpose, and to denote its conjugate transpose. 

Before proceeding with our results, we discuss the somewhat subtle issue of indeterminacies in 
ICA. These ambiguities arise from the fact that the observed X may have multiple decompositions 
into ICA models X = AS -|- r] and X = A'S' + rj'. 

Noise-free ICA has two natural indeterminacies. For any nonzero constant a, the contribution 
of the component Aj^S^ to the model can equivalently be obtained by replacing A^ with and 
Sk with the rescaled signal ^5'fc. To lessen this scaling indeterminacy, we use the conventioij^ that 
cov(S) = I throughout this paper. As such, each source (or equivalently each A^) is defined up 
to a choice of sign (a unit modulus factor in the complex case). In addition, there is an ambiguity 
in the order of the latent signals. For any permutation tt of [m] (where [m] m}), the ICA 

models X = ^k-^k and X = SiT{k)-^TT(k) are indistinguishable. In the noise free setting, 

A is said to be recovered if we recover each column of A up to a choice of sign (or up to a unit 
modulus factor in the complex case) and an unknown permutation. As the sources Si,..., Sm are 
only defined up to the same indeterminacies, inverting the recovered matrix A to obtain a demixing 
matrix works for signal recovery. 

In the noisy ICA setting, there is an additional indeterminacy in the definition of the sources. 
Consider ^ to be 0-mean axis-aligned Caussian random vector. Then, the noisy ICA model X = 
A(S -|- ^) -|- 77 in which ^ is considered part of the latent source signal S' = S and the model 
X = AS -|- (A^ -|- r]) in which ^ is part of the noise are indistinguishable. In particular, the 
latent source S and its covariance are ill-defined. Due to this extra indeterminacy, the lengths of 
the columns of A no longer have a fully defined meaning even when we assume cov(S) = X. In 
the noisy setting, A is said to be recovered if we obtain the columns of A up to non-zero scalar 
multiplicative factors and an arbitrary permutation. 

The last indeterminacy is the most troubling as it suggests that the power of each source signal 
is itself ill-defined in the noisy setting. Despite this indeterminacy, it is possible to perform an 
SINR-optimal demixing without additional assumptions about what portion of the signal is source 
and what portion is noise. In section we will see that SINR-optimal source recovery takes on a 
simple form: Given any solution A which recovers A up to the inherent ambiguities of noisy ICA, 
then A^ cov(X)^^ is an SINR-optimal demixing matrix. 

1.2 Related Work and Contributions 

Independent Component Analysis is probably the most used model for Blind Signal Separation. 
It has seen numerous applications and has generated a vast literature, including in the noisy and 
underdetermined settings. We refer the reader to the books HaE! for a broad overview of the 
subject. 

It was observed early on by Cardoso [3| that ICA algorithms based soley on higher order cumu- 
lant statistics are invariant to additive Gaussian noise. This observation has allowed the creation 
of many algorithms for recovering the ICA mixing matrix in the noisy and often underdetermined 
settings. Despite the significant work on noisy ICA algorithms, they remain less efficient, more 
specialized, or less practical than the most popular noise free ICA algorithms. 

^Alternatively, one may place the scaling information in the signals by setting ||Afc|| = 1 for each k. 


3 



Research on cumulant-based noisy ICA can largely be split into several lines of work which we 
only highlight here. Some algorithms such as FOOBI [3] and BIOME [T] directly use the tensor 
structure of higher order cumulants. In another line of work, De Lathauwer et al. [8] and Yeredor 
|23j have suggested algorithms which jointly diagonalize cumulant matrices in a manner reminiscent 
of the noise-free JADE algorithm [3]. In addition, Yeredor |22j and Goyal et al. m have proposed 
ICA algorithms based on random directional derivatives of the second characteristic function. 

Each line of work has its advantages and disadvantages. The joint diagonalization algorithms 
and the tensor based algorithms tend to be practical in the sense that they use redundant cumu¬ 
lant information in order to achieve more accurate results. However, they have a higher memory 
complexity than popular noise free ICA algorithms such as EastICA m- Moreover, the tensor 
methods (EOOBI and BIOME) require the latent source signals to have positive order 2k {k > 2, 
a predetermined fixed integer) cumulants as they rely on taking a matrix square root. Einally, the 
methods based on random directional derivatives of the second characteristic function rely heavily 
upon randomness in a manner not required by the most popular noise free ICA algorithms. 

We continue a line of research started by Arora et al. [2] and Voss et al. [2T] on fully determined 
noisy ICA which addresses some of these practical issues by using a deflationary approach remi¬ 
niscent of EastICA. Their algorithms thus have lower memory complexity and are more scalable 
to high dimensional data than the joint diagonalization and tensor methods. However, both works 
require a preprocessing step (quasi-orthogonalization) to orthogonalize the latent signals which is 
based on taking a matrix square root. Arora et al. [2] require each latent signal to have positive 
fourth cumulant in order to carry out this preprocessing step. In contrast, Voss et al. m are 
able to perform quasi-orthogonalization with source signals of mixed sign fourth cumulants; but 
their quase-orthogonalization step is more complicated and can run into numerical issues under 
sampling error. We demonstrate that quasi-orthogonalization is unnecessary. We introduce the 
PECI algorithm to work within a (not necessarily positive definite) inner product space instead. 
Experimentally, this leads to improved demixing performance. In addition, we handle the case of 
complex signals. 

Einally, another line of work attempts to perform SINR-optimal source recovery in the noisy 
ICA setting. It was noted by Koldovsky and Tichavsky m that for noisy ICA, traditional ICA 
algorithms such as EastICA and JADE actually outperform algorithms which first recover A in 
the noisy setting and then use the resulting approximation of to perform demixing. It was 
further observed that is not the optimal demixing matrix for source recovery. Later, Koldovsky 
and Tichavsky HZ! proposed an algorithm based on EastICA which performs a low SINR-bias 
beamforming. 


2 Pseudo-Euclidean Gradient Iteration ICA 


In this section, we introduce the PEGI algorithm for recovering A in the “fully determined” noisy 
ICA setting where m < n. PEGI relies on the idea of Gradient Iteration introduced Voss et al. |21j . 
However, unlike GI-ICA Voss et al. |21j . PEGI does not require the source signals to be orthog- 
onalized. As such, PEGI does not require the complicated quasi-orthogonalization preprocessing 
step of GI-ICA which can be inaccurate to compute in practice. We sketch the Gradient Iteration 
algorithm in Section 2.1 and then introduce PEGI in Section 2.2 Eor simplicity, we limit this 


discussion to the case of real-valued signals. We show how to construct PEGI for complex-valued 
signals in Appendix [A) 

In this section we assume a noisy ICA model X = AS -|- r] such that rj is arbitrary Gaussian 
and independent of S. We also assume that m < n, that m is known, and that the columns of A 
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are linearly independent. 


2.1 Gradient Iteration with Orthogonality 

The gradient iteration relies on the properties of cumulants. We will focus on the fourth cumulant, 
though similar constructions may be given using other even order cumulants of higher order. For 
a zero-mean random variable X, the fourth order cumulant may be defined as ka{X) := E[X^] — 
3E[X^]^ [see [71 Chapter 5, Section 1.2]. Higher order cumulants have nice algebraic properties 
which make them useful for ICA. In particular, m has the following properties: (1) (Independence) 
If X and Y are independent, then K 4 (X -|-T) = Ki{X) + K 4 {Y). (2) (Homogeneity) If a is a scalar, 
then K 4 (aX) = a‘^Ki{X). (3) (Vanishing Gaussians) If X is normally distributed then K 4 (V) = 0. 

We consider the following function defined on the unit sphere: /(u) := K 4 ((X, u)). Expanding 
/(u) using the above properties we obtain: 

m m 

/(u) = «;4(^^(Afc,u)S’fc (u,? 7 )^ = J2i^k,u)^K4{Sk) ■ 

k=l k=l 


Taking derivatives we obtain: 


m 

V/(u) = 4 u)3K4(5fc)Afc (1) 

k=l 

m 

= 12 ufK4{Sk)AkAl = AD{u)A^ (2) 

k=l 

where D{u) is a diagonal matrix with entries D{u)kk = 12(Afc, u)^K4(5fc). 

Voss et al. [21] introduced GI-ICA as a fixed point algorithm under the assumption that the 
columns of A are orthogonal but not necessarily unit vectors. The main idea is that the update 
V/(u)/||V/(u)|| is a form of a generalized power iteration. From equation ([^, each Ak may 
be considered as a direction in a hidden orthogonal basis of the space. During each iteration, the 
Ak coordinate of u is raised to the 3’’^ power and multiplied by a constant. Treating this iteration 
as a fixed point update, it was shown that given a random starting point, this iterative procedure 
converges rapidly to one of the columns of A (up to a choice of sign). The rate of convergence is 
cubic. 

However, the GI-IGA algorithm requires a somewhat complicated preprocessing step called 
quasi-orthogonalization to linearly transform the data to make columns of A orthogonal. Quasi- 
orthogonalization makes use of evaluations of Hessians of the fourth cumulant function to construct 
a matrix of the form C = ADA^ where D has all positive diagonal entries—a task which is com¬ 
plicated by the possibility that the latent signals Si may have fourth order cumulants of differing 
signs—and requires taking the matrix square root of a positive definite matrix of this form. How¬ 
ever, the algorithm used for constructing C under sampling error is not always positive definite in 
practice, which can make the preprocessing step fail. We will show how our PEGI algorithm makes 
quasi-orthogonalization unnecessary, in particular, resolving this issue. 

2.2 Gradient Iteration in a Psendo-Enclidean Space 

We now show that the gradient iteration can be performed using in a pseudo-Euclidean space in 
which the columns of A are orthogonal. The natural candidate for the “inner product space” 
would be to use (•,•)* defined as (u, v)* := u’^(AA'^)'^v. Glearly, {Ai,Aj)^ = 5ij gives the desired 
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Algorithm 1 Recovers a column of A up to a scaling factor if uq is generically chosen. 
Inputs: Unit vector uq, C, V/ 
k ^ I 

repeat 

Ufc ^ V/(C'tufc_i)/||V/(CWi)|| 
k k + l 

until Convergence (up to sign) 

return Ufc 


orthogonality property. However, there are two issues with this “inner product space”: First, it 
is only an inner product space when A is invertible. This turns out not to be a major issue, and 
we move forward largely ignoring this point. The second issue is more fundamental: We only 
have access to AA^ in the noise free setting where cov(X) = AA^ . In the noisy setting, we 
have access to matrices of the form = AZ)(u)A^ from equation Q instead, consider a 

pseudo-Euclidean inner product defined as follows: Let C = ADA^ where D is a diagonal matrix 
with non-zero diagonal entries, and define (•, ■)c by (u, v )(7 = vAcA. When D contains negative 
entries, this is not a proper inner product since C is not positive definite. In particular, (A^, Ak)c = 
A^{ADAA)'^Ak = may be negative. Nevertheless, when k A Ji (Afc, Aj)c = A^{ADAA)'^Aj = 0 
gives that the columns of A are orthogonal in this space. 

We define functions a/c : M”" —)• M by afc(u) = (A'^u)fc such that for any u G span(Ai,..., A^), 
then u = is the expansion of u in its A, basis. Continuing from equation 0 , for 

any u G 5”“^ we see Vf{CA) = 4YAk=i{^k,CA)^K4{Sk)Ak = 4:Y2=i{^k,A)%K‘ASk)Ak is the 
gradient iteration recast in the (•, ■)c space. Expanding u in its A^ basis, we obtain 

m m 

VfiCA) = 4^(afc(u)(Afc, Afc)c)"K4(5fc)Afc = 4Y,ak{nf{d^^KASk))Ak , (3) 

k=l k=l 

which is a power iteration in the unseen Ak coordinate system. As no assumptions are made 
upon the K^Sk) values, the scalings which were not present in eq. Q cause no issues. Using 
this update, we obtain Alg. a fixed point method for recovering a single column of A up to an 
unknown scaling. 

Before proceeding, we should clarify the notion of fixed point convergence in Algorithm We 
say that the sequence {u^j^g converges to v up to sign if there exists a sequence {c^j^g such 
that each Ck G {±1} and —)■ v as A: —>• oo. We have the following convergence guarantee. 

Theorem 1. If uq is chosen uniformly at random from S'^~^, then with probability 1, there exists 
I G [m] such that the sequence {u^j^g defined as in Algorithm^ converges to A^/HA^H up to sign. 
Further, the rate of convergence is cubic. 

Due to limited space, we omit the proof of TheoremIt is similar to the proof of IZD Theorem 

4 ]- 

In practice, we test near convergence by checking if we are still making significant progress. 
In particular, for some predefined e > 0, if there exists a sign value Ck G {±1} such that ||ufc — 
CfcUfc_i|| < e, then we declare convergence achieved and return the result. As there are only two 
choices for Ck, this is easily checked, and we exit the loop if this condition is met. 

Pull ICA Recovery Via the Pseudo-Euclidean Gl-Update. We are able to recover a single 
column of A up to its unknown scale. However, for full recovery of A, we would like (given recovered 
columns A^^, • • •, Ai.) to be able to recover a column Ak such that k 0 {Ii,... Aj} demand. 
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Algorithm 2 Full ICA matrix recovery algorithm. Returns two matrices: (1) A is the recovered 
mixing matrix for the noisy ICA model X = AS + r/, and (2) R is a running estimate of A'^. 

1: Inputs: C, V/ 

2 : A ^ 0, B ^ 0 
3: for j 1 to m do 

4 : Draw u uniformly at random from S^~^. 

5: repeat 

6: u ■<— u — ABw 

7: u^V/(Ctu)/||V/(Ctu)||. 

8: until Convergence (up to sign) 

9: -Aj i — U 

10 : B,. ^ \C'Ail({CflAifAj)Y 

11: end for 

12: return A, B 


The idea behind the simultaneous recovery of all columns of A is two-fold. First, instead of 
just finding columns of A using Algorithm we simultaneously find rows of . Then, using 
the recovered columns of A and rows of A\ we project u onto the orthogonal complement of the 
recovered columns of A within the (•, ■)c pseudo-inner product space. 

Recoverir^ rows of . Suppose we have access to a column A^ (which may be achieved using 
Algorithm [r. Let denote the row of A^^. Then, we note that C^A^ = {ADA^)^Aj. = 

recovers A|, up to an arbitrary, unknown constant However, the 

constant may be recovered by noting that {Ak,Ak)c = (C'^A^j^A^ = d'J^^. As such, we may 
estimate A|,. as [C^ Ak/ {{C^ Ak)"^Ak)]'^. 

Enforcing Orthogonality During the GI Update. Given access to a vector u = ^2^=1 ^kiA^k + 
P^iU (where is the projection onto the orthogonal complements of the range of A), some 
recovered columns A£^,...,A£^, and corresponding rows of A^ we may zero out the compo¬ 
nents of u corresponding to the recovered columns of A. Letting u' = u — A^^. AJ , u, then 

u' = X]fce[m]\Ri ir) ^k{A)-^k + particular, u' is orthogonal (in the (•, -jc space) to the 

previously recovered columns of A. This allows the non-orthogonal gradient iteration algorithm to 
recover a new column of A. 

Using these ideas, we obtain Algorithmic which is the PEGI algorithm for recovery of the mix¬ 
ing matrix A in noisy IGA up to the inherent ambiguities of the problem. Within this Algorithm, 
step IC enforces orthogonality with previously found columns of A, guaranteeing that convergence 
to a new column of A. 

Practical Construction of C. In our implementation, we set C = ^ h can be 

shown from equation Q that = ADA^ with dkk = \\Ak\\'^K 4 {Sk)- This deterministi¬ 

cally guarantees that each latent signal has a significant contribution to C. 

3 SINK Optimal Recovery in Noisy ICA 

In this section, we demonstrate how to perform SINK optimal IGA within the noisy IGA framework 
given access to an algorithm (such as PEGI) to recover the directions of the columns of A. To this 
end, we first discuss the SINK optimal demixing solution within any decomposition of the IGA 
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model into signal and noise as X = AS + rj. We then demonstrate that the SINK optimal demixing 
matrix is actually the same across all possible model decompositions, and that it can be recovered. 
The results in this section hold in greater generality than in section They hold even if m > n 
(the underdetermined setting) and even if the additive noise r/ is non-Gaussian. 

Consider B an m x n demixing matrix, and define S{B) := SX the resulting approximation 
to S. It will also be convenient to estimate the source signal S one coordinate at a time: Given a 
row vector b, we define 5(b) := bX. If b = B^. (the row of B), then 5(b) = [8(13)]*; = Sk{B) 
is our estimate to the latent signal 5*,. Within a specific IGA model X = AS + r/, signal to 
intereference-plus-noise ratio (SINK) is defined by the following equation: 


SINRfc(b) := 


_var(bAfc5A;)_ 

var(bAS — bA*;5fc) + var(br 7 ) 


var(bAfc5fc) _ 

var(bAX) — var(bA*;5fc) 


(4) 


SINRfc is the variance of the contribution of k^^ source divided by the variance of the noise and 
interference contributions within the signal. 

Given access to the mixing matrix A, we dehne Ropt = A^{AA^ + cov(t 7))1. Since cov (X) = 
AA'^ + cov( 77), this may be rewritten as Ropt = A^cov(X)^. Here, cov(X)^^ may be estimated from 
data, but due to the ambiguities of the noisy IGA model, A (and specifically its column norms) 
cannot be estimated from data. 

Koldovsky and Tichavsky [15] observed that when rj is a white Gaussian noise, Ropt jointly 
maximizes SINR*, for each k G [m], i.e., SINR*, takes on its maximal value at (Ropt)fe.- Below in 
Proposition]^ we generalize this result to include arbitrary non-spherical, potentially non-Gaussian 
noise. 

It is interesting to note that even after the data is whitened, i.e. cov(A) = X, the optimal 
SINR solution is different from the optimal solution in the noiseless case unless A is an orthogonal 
matrix, i.e. A^ = A^. This is generally not the case, even if rj is white Gaussian noise. 


Proposition 2. For each k G [m\, (Ropt)fc. is a maximizer o/SINR*,. 

The proof of Proposition is deferred to appendix [B] 

Since SINR is scale invariant. Proposition implies that any matrix of the form RRopt = 
DA^ cov(X)l where R is a diagonal scaling matrix (with non-zero diagonal entries) is an SINR- 
optimal demixing matrix. More formally, we have the following result. 

Theorem 3. Let A be an n x m matrix containing the columns of A up to scale and an arbitrary 
permutation. That is, there exists a permutation vr of [m\ and non-zero constants ai, ..., am such 
that akA.,j[k) = for each k G [mj. Then, (A^ cov(A)'l)^(*.). is a maximizer o/SINR*,. 

By Theorem given access to a matrix A which recovers the directions of the columns of A, 
then A^ cov(A)l is the SINR-optimal demixing matrix. For IGA in the presence of Gaussian noise, 
the directions of the columns of A are well defined simply from X, that is, the directions of the 
columns of A do not depend on the decomposition of X into signal and noise (see the discussion in 
section o on IGA indeterminacies). The problem of SINR optimal demixing is thus well defined 
for IGA in the presence of Gaussian noise, and the SINR optimal demixing matrix can be estimated 
from data without any additional assumptions on the magnitude of the noise in the data. 

Finally, we note that in the noise-free case, the SINR-optimal source recovery simplifies to be 
At. 


Corollary 4. Suppose thafK = AS is a noise free (possibly underdetermined) ICA model. Suppose 
that A G contains the columns of A up to scale and permutation, i.e., there exists diagonal 

matrix D with non-zero entries and a permutation matrix H such that A = ARH. Then At is an 
SINR-optimal demixing matrix. 
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(b) Bias under additive Gaussian noise. 


Figure 1: SINR performance comparison of ICA algorithms. 


Proof. By Theorem]^ {AD ^n)^cov(X)l is an SINR-optimal demixing matrix. Expanding, we 
obtain: cov(X)t = D-^{AA^= U^D-^A^ = (ADU)^ = Al □ 

Corollary 1^ is consistent with known beamforming results. In particular, it is known that A^ is 
an optimal (in terms of minimum mean squared error) beamforming matrix for underdetermined 
ICA [ini section 3B]. 


4 Experimental Results 

We now compare the proposed PEGI algorithm with several existing ICA algorithms. In addition 
to qorth+GI-ICA (that is, GI-ICA with the quasi-orthogonalization preprocessing step), we use 
the following algorithmic baselines: 

JADE [3] is a popular fourth cumulant based ICA algorithm designed for the noise free setting. 
We use the implementation of Cardoso and Souloumiac [5]. 

FastICA [12] is a popular ICA algorithm designed for the noise free setting based on a deflationary 
approach of recovering one component at a time. We use the implementation of Gavert et ah [lOj . 
iFicA dnunj is a variation of FastICA with the tanh contrast function designed to have low bias 
for performing SINR-optimal beamforming in the presence of Gaussian noise. 

Ainv is the oracle demixing algorithm which uses A’^ as the demixing matrix. 

SINR-opt is the oracle demixing algorithm which demixes using A^ cov(X)^^ to achieve an SINR- 
optimal demixing. 

We compare these algorithms on simulated data with n = m. We constructed mixing matrices 
A with condition number 3 via a reverse singular value decomposition {A = UAV^). The matrices 
U and V were random orthogonal matrices, and A was chosen to have 1 as its minimum and 3 as 
its maximum singular values, with the intermediate singular values chosen uniformly at random. 
We drew data from a noisy ICA model X = AS -|- rj where cov(r/) = S was chosen to be malaligned 
with cov(AS) = AA^. We set S = p(10X — AA^) where p is a constant defining the noise power. 
It can be shown that p = is the ratio of the maximum directional noise variance 

to the maximum directional signal variance. We generated 100 matrices A for our experiments 
with 100 corresponding ICA data sets for each sample size and noise power. When reporting 
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SINR Comparison 

(d=14, Noise Power=0.67, mixed distr) 
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Figure 2: Accuracy comparison of PEGI using pseudo-inner product spaces and GI-ICA using 
quasi-orthogonalization. 


results, we apply each algorithm to each of the 100 data sets for the corresponding sample size 
and noise power and we report the mean performance. The source distributions used in our IGA 
experiments were the Laplace and Bernoulli distribution with parameters 0.05 and 0.5 respectively, 
the t-distribution with 3 and 5 degrees of freedom respectively, the exponential distribution, and the 
uniform distribution. Each distribution was normalized to have unit variance, and the distributions 
were each used twice to create 14-dimensional data. We compare the algorithms using either SINR 
or the SINR loss from the optimal demixing matrix (defined by SINR Loss = [Optimal SINR — 
Achieved SINR]). 

In Eigure [Tbl we compare our proprosed IGA algorithm with various IGA algorithms for signal 
recovery. In the PEGI-K 4 -I-SINR algorithm, we use PEGI-K 4 to estimate A, and then perform 
demixing using the resulting estimate of cov(X)~^, the formula for SINR-optimal demixing. It 
is apparent that when given sufficient samples, PEGI-K 4 -I-SINR provides the best SINR demixing. 
JADE, EastIGA-tanh, and lEICA each have a bias in the presence of additive Gaussian noise which 
keeps them from being SINR-optimal even when given many samples. 

n Eigure la, we compare algorithms at various sample sizes. The PEGI-K 4 -I-SINR algorithm 
relies more heavily on accurate estimates of fourth order statistics than JADE, and the EastlCA- 
tanh and lEICA algorithms do not require the estimation of fourth order statistics. Eor this reason, 
PEGI-K 4 -I-SINR requires more samples than the other algorithms in order to be run accurately. 
However, once sufficient samples are taken, PEGI-K 4 -I-SINR outperforms the other algorithms in¬ 
cluding lEICA which is designed to have low SINR bias. 

In order to avoid clutter, we did not include qorth-|-GI-ICA-K 4 -|-SINR (the SINR optimal demix¬ 
ing estimate constructed using qorth-|-GI-ICA-K 4 to estimate A) in the figures [Tb| and la It is also 
assymptotically unbiased in estimating the directions of the columns of A, and similar conclu¬ 
sions could be drawn using qorth-|-GI-ICA-K 4 in place of PEGI-K 4 . However, in Eigure we 
see that PEGI-K 4 -I-SINR requires fewer samples than qorth-|-GI-ICA-K 4 -|-SINR to achieve good 
performance. This is particularly highlighted in the medium sample regime. 


On the Performance of Traditional ICA Algorithms for Noisy ICA. An interesting 
observation [first made in [15] is that the popular noise free ICA algorithms JADE and EastICA 
perform reasonably well in the noisy setting. In Eigures [Tb| and |Ta| they signihcantly outperform 
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demixing using A~^ for source recovery. It turns out that this may be explained by a shared 
preprocessing step. Both JADE and FastICA rely on a whitening preprocessing step in which 
the data are linearly transformed to have identity covariance. It can be shown in the noise free 
setting that after whitening, the mixing matrix A is a rotation matrix. These algorithms proceed 
by recovering an orthogonal matrix A to approximate the true mixing matrix A. Demixing is 
performed using A~^ = A^. Since the data is white (has identity covariance), then the demixing 
matrix A^ = A^cov(X)“^ is an estimate of the SINR-optimal demixing matrix. Nevertheless, the 
traditional ICA algorithms give a biased estimate of A under additive Gaussian noise. 
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A PEGI for Complex Signals 

In Section we showed how to perform gradient iteration IGA within a pseudo-Euclidean inner 
product space. In this appendix, we show how this PEGI algorithm can be extended to include 
complex valued signals. For clarity, we repeat the entire PEGI algorithmic construction from 
Section with the necessary modifications to handle the complex setting. 

Throughout this appendix, we assume a noisy ICA model X = AS -|- r) where ij is an arbitrary 
Gaussian noise independent of S. We also assume that m < n, that m is known, and that the 
columns of A are linearly dependent. 
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A.l Fourth Cumulants of Complex Variables 

The gradient iteration relies on the properties of cumulants. We will focus on the fourth cumulant, 
though similar constructions may be given using other even order cumulants of higher order. We will 
use two versions of the fourth cumulant which capture slightly different fourth order information. 
For a zero-mean random variable X, they may be defined as Ki{X) := E[X^] — 3E[A^]^ and 
k\{X) := K[X‘^X*'^] — 2E[W*]^ — E[A^]E[A*^]. For real random variables, these two definitions 
are equivalent, and they come from two different conjugation schemes when constructing the fourth 
order cumulant [see El Chapter 5, Section 1.2]. However, in general, only is guaranteed to be 
real valued. The higher order cumulants have nice algebraic properties which make them useful for 
ICA: 

1. (Independence) If X and Y are independent random variables, then k/i{X -|- T) = k,a{X) + 
K 4 (F) and kI{X + Y) = k^{X + Y). 

2. (Homogeneity) If a is a scalar, then K 4 (aA) = a'^K^^X) and K\{aX) = \a\^K\{X). 

3. (Vanishing Gaussians) If X is normally distributed then K 4 {X) = 0 and ii^{X) = 0. 

In this appendix, we consider a noisy ICA model X = AS -|- rj where ry is a 0-mean (possibly 
complex) Gaussian and independent of S. We consider the following functions defined on the unit 
sphere: /(u) := «: 4 ((X,u)) and /*(u) := ^^((X,u)). Then, expanding using the above properties 
we obtain: 


/(u) = kA ^(Afc,u)S’fc -k (u,?7) j 

k=l ^ 

m 

= y~](Afc,u)^K4(5'fc) 
k=l 

Using similar reasoning, it can be seen that /*(u) = YAk=i u)|^K:4(5fc). 

It turns out that some slightly non-standard notions of derivatives are most useful in construct¬ 
ing the gradient iteration in the complex setting. We use real derivatives for the gradient and we 
use the complex Hessian. In particular, expanding + iyk: we use the gradient operator 

V := We make use of the operators Oufc := and dul := ^(gf^+igf^) to 

define Ti := YA=i^k^J^'akdu'j. Applying this version of the Hessian is different than using 

real derivatives as in the gradient operation. 

Taking derivatives, we obtain: 


V/(u) = 4^(Afc,u)^K4(5fc)Afc 

k=l 


(5) 


H/*(u) = 4Y,\{Ak,n)\\liSk)AlAl 
k=l 

= A*D{u)A^ (6) 

where D{u) is a diagonal matrix with entries D{u)kk = 4|(Afc, u)|^K 4 (S'fc). 
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Algorithm 3 Recovers a column of A up to an unknown scaling factor when uq is generically 
chosen._ 

Inputs: uo (A unit vector), C, V/ 
k <— 1 

repeat 

Ufc ^ V/(C't*Ufc_i)/||V/(Ct*Ufc_i)|| 

k <— k + 1 

until Convergence (up to a unit modulus factor) 

return u^ 


A.2 Gradient Iteration in a Pseudo-Euclidean Space 

We now demonstrate that the gradient iteration can be performed using a generalized notion of an 
inner product space in which the columns of A are orthogonal. The natural candidate for the “inner 
product space” would be to use (■,■}* defined as (u, v)* ;= {A*AJ')'^'v*. Clearly, = Sij 

gives the desired orthogonality property. However, there are two issues with this “inner product 
space”: First, it is only an inner product space when A is non-singular (invertible). This turns 
out not to be a major issue, and we will move forward largely ignoring this point. The second 
issue is more fundamental: We only have access to the matrix A*A'^ in the noise free setting 
where cov(X)^ = (AA^)"^ = A* . In the noisy setting, we have access to matrices of the form 
'R/*(u) = A*D{\i)A^ from equation Q instead. 

We consider a pseudo-Euclidean inner product defined as follows: Let C = A*DA^ where D is 
a diagonal matrix with non-zero diagonal entries, and define (•, ■)c by (u, v)c = u^CW*. When 
D contains negative entries, this is not a proper inner product since C is not positive definite. 
In particular, {Ak,Ak)c = AliA-DA^Ml = may be negative. Nevertheless, when k ^ j, 
{Af^, Aj)c = A^{A*DA^)"^A*- = 0 gives that the columns of A are orthogonal in this space. 

We define functions oa, : C"" —>■ C by aA;(u) = (Afu)^ such that for any u G span(Ai,..., Am), 
then u = 1 ai{\i)Ai is the expansion of u in its Ai basis. Continuing from equation ([^, for any 

u G 5”“^ we see 


V/(Ct*u) = 4^(Afc,Ct*u)3AC4(5fc)AA, 

k=l 

n 

= 4:'^{Ak,u)l;Ki{Sk)Ak 

k=l 

is the gradient iteration recast in the (•, ■)c space. Expanding u in its Ak basis, we obtain 

m 

V/(Ct*u) = 4^(aA(u)(Afc, AA,)c)"K4(Sfc)Afc 

k=l 

m 

= 4^Q:fc(u)^(d“^^K4(5fc))AA; , (7) 

fc=i 

which is a power iteration in the unseen A^ coordinate system. As no assumptions are made upon 
the K 4 (S'fc) values, the scalings which were not present in equation Q cause no issues. Using 
this update, we obtain Algorithm a fixed point method for recovering a single column of A up 
to an unknown scaling. 
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Before proceeding, we should clarify the notion of fixed point convergence in Algorithm We 
say that the sequence {u^j^g converges to v up to a unit modulus factor if there exists a sequence 
of constants {cfcj^g such that each \ck\ = 1 and CfcUfc — v as A: —)■ oo. We have the following 
convergence guarantee. 

Theorem 5. /fuo is chosen uniformly at random from 5'”'“^. Then with probability 1, there exists 
I G [m] such that the sequence {wj^g defined as in Algorithm^ converges to a up to a 

unit modulus factor. Further, the rate of convergence is cubic. 

Due to space limitations, we omit the proof of Theorem However, its proof is very similar 
that of an analogous result for the GI-ICA algorithm |21l Theorem 4], 

In practice, we test near convergence by testing if we are still making significant progress. 
In particular, for some predefined e > 0, if there exists a unit modulus constant Ck such that 
||ufc — CfcUfc_i|| < e, then we declare convergence achieved and return the result. We may determine 
Ck using the following fact. 

Fact 6. Suppose that u and v are non-orthogonal unit modulus vectors. The expression ||u — e*^v|| 
is minimized by the choice of 9 = atan2(Im((u, v)), Re((u, v))). 

Letting 9 = atan2(Im((ufc, Ufc_i)), Re((ufc, Ufc_i)), we exit the loop if ||ufc — e®®Ufc_i|| < e. 

A.3 Full ICA Recovery Via the Pseudo-Euclidean Gl-Update 

We are able to recover a single column of A in noisy ICA. However, for full matrix recovery, we 
would like (given recovered columns A^.^,..., A^.) to be able to recover a column Ak such that 
k 0 {£i ,... ,ij} on demand. 

The main idea behind the simultaneous recovery of all columns of A is two-fold. First, instead 
of just finding columns of A using Algorithm we simultaneously find rows of A^. Then, using 
the recovered columns of A and rows of A\ we may project u onto the orthogonal complement of 
the recovered columns of A within the (•, •)c pseudo-Euclidean inner product space. 

Recovering rows of A^. Suppose we have access to a column Ak (which may be achieved using 
Algorithm]^. Let denote the row of A^. Then, we note that C^A*f. = {A*DA'^)^Al = 
~ recovers A|,. up to an arbitrary, unknown constant d^^. However, the 

constant df^ may be recovered by noting that {Ak,Ak)c = iC^Ak)"^Ak = df^. As such, we may 
estimate as [C^Ak/{{C^Ak)"^Ak)]"^. 

Enforcing Orthogonality During the GI Update. Given access to u = Yl'k=i^k{^)Ak + 
Pa±u, some recovered columns A^j,..., A^^, and corresponding rows of A^, we may zero out the 
components of u corresponding to the recovered columns of A. Letting u' = u — A^^. aJ , u, 

then u' = X]fce[m]\{£i particular, u' is orthogonal (in the {■,-)c space) 

to the previously recovered columns of A. This allows us to modify the non-orthogonal gradient 
iteration algorithm to recover a new column of A. 

Using these ideas, we obtain the Algorithm for recovery of the IGA mixing matrix. Within 
this Algorithm, step enforces orthogonality with previously found columns of A, guaranteeing 
that convergence is to a new column of A. 

Practical Construction of C We suggest the choice oiC = \ Yl'k=i it can be shown 

from equation ([^ that J2k=i'^f*i^k) = A*DA^ with dkk = \\Ak\\‘^K,\{Sk). This deterministically 
guarantees that each latent signal has a significant contribution to C. 
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Algorithm 4 Full ICA matrix recovery algorithm. Estimates and returns two matrices: (1) A is 
the recovered mixing matrix for the noisy ICA model X = AS + r/, and (2) B is a running estimate 
of jt._ 

1: Inputs: C, V/ 

2: A^O, B^O 

3: for jA— 1 to m do 

4: Draw u uniformly at random from 

5: repeat 

6: U U — Al.Bu 

7: u^V/(Ct*u)/||V/(Ct*u)||. 

8: until Convergence (up to a unit modulus factor) 

9: Aj i — U 

10 : Bj. ^ [C^Aj/{{C^AjfAj)f 

11: end for 
12: return A, B 


B Proof of Proposition 

Proof. This proof is based on the connection between two notions of optimality, minimum mean 

sqnared error and SINK. The mean squared error of the recovered signal -S(b) from latent signal 

^ 2 

is defined as MSEfc(b) := E[|5fc — 5'(b)| ]. It has been shown [U equation 39] that i?opt jointly 
minimizes the mean sqnared errors of the recovered signals. In particular, if b = {Bopt)k-, then b 
is a minimizer of MSEfc(b). 

We will first show that finding a matrix B which minimizes the mean squared error has the side 
effect of maximizing the magnitude of the Pearson correlations Sk{B) k G [m], where 

Po Q ■= We will then demonstrate that if H is a maximizer of jpe a then Bu. 

rSk,Sk(B) '^Sk,Sk(B)0 k 

is a maximizer of SINR^. These two facts imply the desired result. We will use the convention that 

PSkAiB) 1® 0 if ^Su(.B) = 0- 
We fix a /c G [mj. We have: 

MSEfc(b) = E[SkSl -2Re{SkS*{h)) + S(b)5*(b)] 

= 1 - R'e(Psfe,5(b)) + ii'l(b) • 

Letting w = sgn(p_g^ S(b))’ obtain 


PSk,S{‘^h) 


E[5fc5*(wb)] _ ,E[5fc5*(b)] 

- — id - 

^Sk^S{uih) ^Sk^^sih) 


I^SfcA(b) 


( 8 ) 


Enrther, MSEfc(a;b) = 1 — ‘^(i's{h)\PSk S(b)l '^l(b) — equality if and only if pg^ 

is real and non-negative. As such, all global minima of MSE^ are contained in the set A = {b | 
PSk S(b) ^ [f*’ f]}’ restrict onr investigation to this set. 

We define a fnnction g{x, y) := 1 — 2xy + y‘^ such that under the change of variable x(b) = 
and y(b) = pg^ we obtain MSEfc(b) = g{x,y). Let M = maxbe^p^^ and let yo £ [0,M] 

be fixed. Then, arg min 2 ,gRp(x, yo) = Vo with the resulting value g{yo,yo) = ^ — Vo- As such, the 
minimum of g{x, y) over the domain M x [0, M] occurs when x = y = M . IfM = 0, then the 
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choice of ^ = 0 satisfies that x(^) = y(^) = 0, making MSEfc(4) = g{x,y) the global minimum of 
MSEfc. If M 7 ^ 0, then we may choose ^ such that y{$) = = M. As > 0 must hold, 

it follows that there exists a G (0, oo) such that setting C = we obtain =)x(C) = yi$)- 

Since y(h) = pg^ is scale invariant, we obtain that x{C) = y{C) = y(^) = M, making C a global 
minimum of MSE^. In both cases, it follows that if b minimizes MSEfc(b), then b maximizes 

Ps„s{b) 

Erom equation Q, we see that maxbgC" 1^5^ g(b)l ~ ™axbg^/ 0 _ 5 ^ Thus if b is a minimizer 
of MSEfc(a;b), then b is also a maximizer of \pg^ 5 (b)I claimed. 

We now demonstrate that b is a maximizer of | pg^ | if and only if it is also a maximizer 

of SINRfc(b). Under the conventions that | = +oo when x > 0 and that g = soo where s = — I 
for maximization problems and s = +1 for minimization problems, the following problems have 
equivalent solution sets over choices of b: 


maxSINRfc(b) 


max ■ 




var(S'(b)) - E[|bAfcS'fc|' 


= max- 


iE[5fc5*(b)]r 


. var(,S(b))-|E[5fc5*(b)]| 

mm- K -= mm 

b |E[5fc5*(b)]| b 

|E[5fc5*(b)]|' _ , ,2 

max-X-= max Pc, 

b var(5(b)) b ''^5fc,5(b)i 


var(5(b))-|E[5fc5*(b)]r 
var(,S(b)) 


iE[5fc5*(b)]r 


In the above, the first equivalence is a rewriting of equation S- To see the second equivalence, 
we note that |E[5fc5*(b)]| = |E[5fc(bAS + bT 7 )*]|^ = |bAfc|^ using the independence of from all 
other terms. Then, noting that |bAfcp = E[|bAfcS'fcp] gives the equivalence. The fourth equivalence 
is only changing the problem by the additive constant —1. □ 
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